Statistical computer assisted translation
نویسنده
چکیده
In recent years, significant improvements have been achieved in statistical machine translation (MT), but still even the best machine translation technology is far from replacing or even competing with human translators. However, an MT system helps to increase the productivity of human translators. Usually, human translators edit the MT system output to correct the errors, or they may edit the source text to limit vocabulary. A way of increasing the productivity of the whole translation process (MT plus human work) is to incorporate the human correction activities in the translation process, thereby shifting the MT paradigm to that of computer-assisted translation (CAT). In a CAT system, the human translator begins to type the translation of a given source text; by typing each character the MT system interactively offers and enhances the completion of the translation. Human translator may continue typing or accept the whole completion or part of it. Here, we will use a fully fledged translation system, phrase-based MT, to develop computer-assisted translation systems. An important factor in a CAT system is the response time of the MT system. We will describe an efficient search space representation using word hypotheses graphs, so as to guarantee a fast response time. The experiments will be done on a small and a large standard task. Skilled human translators are faster in dictating than typing the translations, therefore a desired feature of a CAT system is the integration of human speech into the CAT system. In a CAT system with integrated speech, two sources of information are available to recognize the speech input: the target language speech and the given source language text. The target language speech is a human-produced translation of the source language text. The main challenge in the integration of the automatic speech recognition (ASR) and the MT models in a CAT system, is the search. The search in the MT and in the ASR systems are already very complex, therefore a full single search to combine the ASR and the MT models will considerably increase the complexity. In addition, a full single search becomes more complex since there is not any specific model nor any appropriate training data. In this work, we study different methods to integrate the ASR and the MT models. We propose several new integration methods based on N -best list and word graph rescoring strategies. We study the integration of both single-word based MT and phrase-based MT with ASR models. The experiments are performed on a standard large task, namely the European parliament plenary sessions. A CAT system might be equipped with a memory-based module that does not actually translate, but find the translation from a large database of exact or similar matches from sentences or phrases that are already known. Such a database, known as bilingual corpora are also essential in training the statistical machine translation models. Therefore, having a larger database means a more accurate and faster translation system. In this thesis, we will also investigate the efficient ways to compile bilingual sentence-aligned corpora from the Internet. We propose two new methods for sentence alignment. The first one is a typical extension of the existing methods in the field of sentence alignment for parallel texts. We will show how we can employ sentence-length based models, word-to-word translation models, cognates, bilingual lexica, and any other features in an efficient way. In the second method, we propose a new method for aligning sentences based on bipartite graph matching. We show that this new algorithm has a competitive performance with other methods for parallel corpora, and at the same time it is very useful in handling different order of sentences in a source text and its corresponding translation text. Further, we propose an efficient way to recognize and filter out wrong sentence pairs from the bilingual corpora.
منابع مشابه
Automatic text dictation in computer-assisted translation
In this paper, we study the incorporation of statistical machine translation models to automatic speech recognition models in the framework of computer-assisted translation. The system is given a source language text to be translated and it shows the source text to the human translator to translate it orally. The system captures the user speech which is the dictation of the target language sent...
متن کاملOnline Learning Approaches in Computer Assisted Translation
We present a novel online learning approach for statistical machine translation tailored to the computer assisted translation scenario. With the introduction of a simple online feature, we are able to adapt the translation model on the fly to the corrections made by the translators. Additionally, we do online adaption of the feature weights with a large margin algorithm. Our results show that o...
متن کاملبهبود و توسعه یک سیستم مترجمیار انگلیسی به فارسی
In recent years, significant improvements have been achieved in statistical machine translation (SMT), but still even the best machine translation technology is far from replacing or even competing with human translators. Another way to increase the productivity of the translation process is computer-assisted translation (CAT) system. In a CAT system, the human translator begins to type the tra...
متن کاملCollocational Translation Memory Extraction Based on Statistical and Linguistic Information
In this paper, we propose a new method for extracting bilingual collocations from a parallel corpus to provide phrasal translation memories. The method integrates statistical and linguistic information to achieve effective extraction of bilingual collocations. The linguistic information includes parts of speech, chunks, and clauses. The method involves first obtaining an extended list of Englis...
متن کاملStatistical Phrase-Based Models for Interactive Computer-Assisted Translation
Obtaining high-quality machine translations is still a long way off. A postediting phase is required to improve the output of a machine translation system. An alternative is the so called computerassisted translation. In this framework, a human translator interacts with the system in order to obtain high-quality translations. A statistical phrase-based approach to computer-assisted translation ...
متن کاملInteractive-predictive speech-enabled computer-assisted translation
In this paper, we study the incorporation of statistical machine translation models to automatic speech recognition models in the framework of computer-assisted translation. The system is given a source language text to be translated and it shows the source text to the human translator to translate it orally. The system captures the user speech which is the dictation of the target language sent...
متن کامل